Skip to content

[trainer/deepspeed] load_best_model (reimplement re-init)#17151

Merged
stas00 merged 8 commits intomainfrom
ds-load_best_model
Jun 2, 2022
Merged

[trainer/deepspeed] load_best_model (reimplement re-init)#17151
stas00 merged 8 commits intomainfrom
ds-load_best_model

Conversation

@stas00
Copy link
Contributor

@stas00 stas00 commented May 10, 2022

This PR fixes #17114

The deepspeed_reinit hack proved to not always work, as some stored args appear to be either stale or wrong (e.g. the optimizer can be a deepspeed outer optimizer which shouldn't be the case), so trying to just do a full init from scratch instead.

And then there was an issue on the deepspeed side with model getting deepspeed hooks added multiple time which was breaking everything. Fixed in deepspeedai/DeepSpeed#1947.

I spent many hours trying to reproduce the problem in the usual way via examples scripts to make a test, but alas, it just won't fail in the right places. So I ended up re-implemented test_load_best_model using code derived from @base-y's repro example. So super appreciating having their script.

Blocking events

Fixes: #17114

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 10, 2022

The documentation is not available anymore as the PR was closed or merged.

@dumpmemory
Copy link
Contributor

when will this pr be merged ?

@stas00
Copy link
Contributor Author

stas00 commented Jun 2, 2022

We needed to wait for a new deepspeed release, which I see has happened already, so yes, we can merge this shortly.

@stas00 stas00 changed the title [WIP] [trainer/deepspeed] load_best_model [trainer/deepspeed] load_best_model (reimplement re-init) Jun 2, 2022
@stas00 stas00 marked this pull request as ready for review June 2, 2022 15:51
@stas00 stas00 requested a review from sgugger June 2, 2022 15:51
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stas00 stas00 merged commit 2f59ad1 into main Jun 2, 2022
@stas00 stas00 deleted the ds-load_best_model branch June 2, 2022 16:14
Narsil pushed a commit to Narsil/transformers that referenced this pull request Jun 7, 2022
…e#17151)

* [trainer/deepspeed] load_best_model

* to sync with DS PR huggingface#1947

* simplify

* rework load_best_model test

* cleanup

* bump deepspeed>=0.6.5

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…e#17151)

* [trainer/deepspeed] load_best_model

* to sync with DS PR huggingface#1947

* simplify

* rework load_best_model test

* cleanup

* bump deepspeed>=0.6.5

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jun 16, 2022
…e#17151)

* [trainer/deepspeed] load_best_model

* to sync with DS PR huggingface#1947

* simplify

* rework load_best_model test

* cleanup

* bump deepspeed>=0.6.5

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error on loading saved optimizer after training (zero-3)

5 participants